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EVALUATING PROBABILITY FORECASTS 

By Tze Leung Lai 1 , Shulamith T. Gross 2 and David Bo Shen 

Stanford University, Baruch College/CUNY and UBS 

Probability forecasts of events are routinely used in climate pre- 
dictions, in forecasting default probabilities on bank loans or in esti- 
mating the probability of a patient's positive response to treatment. 
Scoring rules have long been used to assess the efficacy of the fore- 
cast probabilities after observing the occurrence, or nonoccurrence, 
of the predicted events. We develop herein a statistical theory for 
scoring rules and propose an alternative approach to the evaluation 
of probability forecasts. This approach uses loss functions relating the 
predicted to the actual probabilities of the events and applies mar- 
tingale theory to exploit the temporal structure between the forecast 
and the subsequent occurrence or nonoccurrence of the event. 

1. Introduction. Probability forecasts of future events are widely used 
in diverse fields of application. Oncologists routinely predict the probability 
of a cancer patient's progression- free survival beyond a certain time horizon 
[Hari et al. (2009)]. Economists give the probability forecasts of an economic 
rebound or a recession by the end of a fiscal year. Banks are required by 
regulators assessing their capital requirements to predict periodically the 
risk of default of the loans they make. Engineers are routinely called upon 
to predict the survival probability of a system or infrastructure beyond five 
or ten years; this includes bridges, sewer systems and other structures. Fi- 
nally, lawyers also assess the probability of particular trial outcome [Fox and 
Birke (2002)] in order to determine whether to go to trial or settle out of 
court. This list would not be complete without mentioning the field that 
is most advanced in its daily probability predictions, namely meteorology. 
In the past 60 years, remarkable advances in forecasting precipitation prob- 
abilities, temperatures, and rainfall amounts have been made in terms of 
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breadth and accuracy. Murphy and Winkler (1984) provide an illuminating 
history of the US National Weather Service's transition from nonprobabilis- 
tic to probability predictions and its development of reliability and accuracy 
measures for these probability forecasts. Accuracy assessment is difficult to 
carry out directly because it requires comparing a forecaster's predicted 
probabilities with the actual but unknown probabilities of the events under 
study. Reliability is measured using "scoring rules," which are empirical dis- 
tance measures between repeated predicted probabilities of an event, such as 
having no rain the next day, and indicator variables that take on the value 
1 if the predicted event actually occurs, and otherwise; see Gneiting and 
Raftery (2007), Gneiting, Balabdaoui and Raftery (2007) and Ranjan and 
Gneiting (2010) for recent reviews and developments. 

To be more specific, a scoring rule for a sequence of n probability forecasts 
Pi, i = 1, . . . , n, is the average score n~ l Y17=i L(Yi,pi), where Yi = 1 or 
according to whether the ith event Ai actually occurs or not. An example 
is the widely used Brier's score L(y,p) = (y — p) 2 [Brier (1950)]. Noting 
that the Yi are related to the actual but unknown probability pi via Yi ~ 
Bernoulli (pi), Cox (1958) proposed to evaluate how well the pi predict pi by 
using the estimates of (/3i,/32) in the regression model 

(1.1) logit(p i ) = /3 1 + /3 2 logit(p i ) 

and developed a test of the null hypothesis (Pi,^) = (0,1), which corre- 
sponds to perfect prediction. Spiegelhalter (1986) subsequently proposed a 
test of the null hypothesis Hq :pi=pi for all i = 1 , . . . , re, based on a standard- 
ized form (under Hq) of Brier's score. A serious limitation of this approach is 
the unrealistic benchmark of perfect prediction to formulate the null hypoth- 
esis, so significant departures from it are expected when n is large, and they 
convey little information on how well the pi predict pi. Another limitation 
is the implicit assumption that the pi are independent random variables, 
which clearly is violated since pi usually involves previous observations and 
predictions. 

Seillier-Moiseiwitsch and Dawid (1993) have developed a hypothesis test- 
ing approach that removes both limitations in testing the validity of a se- 
quence of probability forecasts. The forecaster is modeled by a probability 
measure under which the conditional probability of the occurrence of Ai 
given the cr-field Qi-\ generated by the forecaster's information set prior to 
the occurrence of the event is 7Tj. In this model, the forecaster uses pi = Hi as 
the predicted probability of Ai. As pointed out earlier by Dawid (1982), this 
model fits neatly into de Finetti's (1975) framework in which "the coherent 
subjectivist Bayesian can be shown to have a joint probability distribution 
over all conceivably observable quantities," which is represented by the prob- 
ability measure IT in the present case. To test if IT is "empirically valid" based 
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on the observed outcomes Yi, . . . , Y n , Seillier-Moiseiwitsch and Dawid (1993) 
consider the null hypothesis Hq that "the sequence of events is generated by 
the same joint distribution from which the forecasts are constructed." Under 
this null hypothesis, Yli=i&(Yi — 7Tj),7i > 1, is a martingale with respect to 
the filtration {Gi} when £j is t/j_i-measurable for all i. Assuming certain 
regularity conditions on £j, they apply the martingale central limit theorem 
to show that as n — > oo, 



this model of a coherent forecaster, Seillier-Moiseiwitsch and Dawid (1993) 
have made use of (1.2) to construct various tests of Hq. One such test, de- 
scribed at the end of their Section 6, involves another probability forecast 
p' i: which is "based on no more information" to define £j, so that a signifi- 
cantly large value of the test statistic can be used to reject Hq in favor of 
the alternative forecasting model or method. 

Hypothesis testing has been extended from testing perfect prediction or 
empirical validity of a sequence of probability forecasts to testing equal- 
ity of the predictive performance of two forecasts; see Redelmeier, Bloch 
and Hickam (1991) who extended Spiegelhalter's approach mentioned above. 
Testing the equality of predictive performance, measured by some loss func- 
tion of the predictors and the realized values, of two forecasting models or 
methods has attracted much recent interest in the econometrics literature, 
which is reviewed in Section 6.2. In this paper we develop a new approach 
to statistical inference, which involves confidence intervals rather than sta- 
tistical tests of a null hypothesis asserting empirical validity of a forecasting 
model or method, or equal predictive performance for two forecasting models 
or methods. The essence of our approach is to evaluate probability forecasts 
via the average loss L n = n~ l Y^7=i L(pi,Pi), where pi is the actual but un- 
known probability of the occurrence of Ai. When L is linear in pi, L(Yi,pi) 
is an unbiased estimate of L(pi,pi) since E(Yi\pi) =pi. We show in Section 
2, where an overview of loss functions and scoring rules is also given, that 
even for L that is nonlinear in pi there is a "linear equivalent" which carries 
the same information as L for comparing different forecasts. In Section 3 we 
make use of this insight to construct inferential procedures, such as confi- 
dence intervals, for the average loss L n under certain assumptions and for 
comparing the average losses of different forecasts. 

Note that we have used E to denote expectation with respect to the 
actual probability measure P, under which Ai occurs with probability pi 
given the previous history represented by the cr-field Gi-i, and that we have 
used II to denote the probability measure assumed by a coherent Bayesian 



under Hq, where 



(1.2) 
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forecaster whose probability of occurrence of A{ given Qi—\ is 7Tj. Because 
7Tj = f>i for a coherent Bayesian forecaster, Seillier-Moiseiwitsch and Dawid 
(1993) are able to use (1.2) to test the null hypothesis of empirical validity 
of II in the sense that Eu(Yi\^i-i) = Pi> where Ejj denotes expectation with 
respect to the measure II. Replacing II by P is much more ambitious, but it 
appears impossible to derive the studentized version of the obvious estimate 
L n = n -1 Ya=i L(Yi,pi) and its sampling distribution under P to perform 
inference on L n . We address this difficulty in several steps in Section 3. 
First we consider in Section 3.1 the case in which L(p,p) is linear in p 
and make use of the martingale central limit theorem to prove an analog 
of (1.2) with pi in place of 7Tj and & = L(l,pi) — L(0,pi). Whereas 7r,; = 
Pi under II, the pi associated with P are unknown parameters that need 
to be estimated. Postponing their estimation to Section 3.4, we first use 
the simple bound pi(l — pi) < 1/4 to obtain confidence intervals for L n by 
making use of this analog of (1.2). In Section 3.2 we consider the problem 
of comparing two probability forecasts via the difference of their average 
losses, and make use of the idea of linear equivalents introduced in Section 
2 to remove the assumption of L(p,p) being linear in p when we consider 
A n = n _1 ^2™ = i{L(pi,pt) — L(pi,p'-)}. A variant of A n , called Winkler's skill 
score in weather forecasting, is considered in Section 3.3. In Section 3.4, we 
return to the problem of estimating pi(l — Pi). Motivated by applications 
in which the forecasts are grouped into "risk buckets" within which the 
Pi can be regarded as equal, Section 3.4 provides two main results on this 
problem. The first is Theorem 3, which gives consistent estimates of the 
asymptotic variance of A n , or of L n when L(p,p) is linear in p, in the 
presence of risk buckets with each bucket of size 2 or more. The second, given 
in Theorem 4, shows that in this bucket model it is possible to adjust the 
Brier score to obtain a consistent and asymptotically normal estimate of the 
average squared error loss L n = n _1 Y17=i(Pi ~Pi) 2 - Theorem 4 also provides 
a consistent estimate of the asymptotic variance of the adjusted Brier score 
when the bucket size is at least 3. In Section 3.5 we develop an analog of 
Theorem 3 for the more general setting of "quasi-buckets," for which the pi 
within each bin (quasi-bucket) need not be equal. These quasi-buckets arise 
in "reliability diagrams" in the meteorology literature. Theorem 5 shows that 
the confidence intervals obtained under an assumed bucket model are still 
valid but tend to be conservative if the buckets are actually quasi-buckets. 
The proofs of Theorems 4 and 5 are given in Section 5. 

Section 4 gives a simulation study of the performance of the proposed 
methodology, and some concluding remarks and discussion are given in Sec- 
tion 7. In Section 6 we extend the Yi from the case of indicator variables 
of events to more general random variables by modifying the arguments in 
Section 5, and also show how the methods and results in Sections 3.2 and 3.4 
can be used to address related problems in the econometrics literature on 
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the expected difference in scores between two forecasts, after a brief review 
of that literature that has become a major strand of research in economic 
forecasts. 

2. Scoring rules and associated loss functions. Instead of defining a scor- 
ing rule via L (which associates better forecasts with smaller values of L), 
Gneiting and Raftery (2007) and others assign higher scores to better fore- 
casts; this is tantamount to using — L instead of L in defining a scoring 
rule. More generally, considering p and its forecast p as probability mea- 
sures, they call a scoring rule S proper relative to a class V of probability 
measures if E p S(Z,p) > E p S(Z,p) for all p and p belonging to V, where 
Z is an observed random vector (generated from p) on which scoring is 
based. For the development in the subsequent sections, we find it more con- 
venient to work with L instead of —L and restrict to Z = (Y\, . . . ,Y n ) so 
that S(Z, (pi,... ,Pn)) = -n' 1 Y%=1 L ( Y i,Pi)- 

The function L in the scoring rule n _1 Y17=i L(Yi->Pi) measures the close- 
ness of the probability forecast pi of event i before the indicator variable Yj 
of the event is observed. We can also use L as a loss function in measuring 
the accuracy of pi as an estimate of the probability pi of event i. Besides 
the squared error loss L(p,p) = (p — p) 2 used in Brier's score, another widely 
used loss function is the Kullback-Leibler divergence, 

(2.1) L(p,p) =p\og(p/p) + (1 - p) log[(l - p)/(l - p)}, 

which is closely related to the log score introduced by Good (1952), as shown 
below. More general loss functions of this type are the Bregman divergences; 
see Section 3.5.4 of Griinwald and Dawid (2004) and Section 2.2 of Gneiting 
and Raftery (2007). 

We call a loss function L(p,p) a linear equivalent of the loss function 
L(p,p) if L(p,p) is a linear function of p and 

(2.2) L(p,p) — L(p,p) does not depend on p. 

For example, L(p,p) = —2pp + p 2 is a linear equivalent of the squared error 
loss (p — p) 2 used by Brier's score. A linear equivalent L of the Kullback- 
Leibler divergence (2.1) is given by —L(p,p) =plog(p) + (1 — p)log(l —p). 
This is the conditional expected value (given p) of Y log(j5) + (1 — Y) log(l — 
p), which is Good's log score. Since the probability pi is determined before 
the Bernoulli random variable Yi is observed, 

(2.3) E{L(Yi,p i )\p i ,p i }=p i L(l,p i ) + (l- P i)L(0,p i ). 

Therefore the conditional expected loss of a scoring rule L(Y,p) yields a loss 
function 



(2.4) 



L(p,p) = {L(l,p) - L(0,p)}p + L(0,p) 
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that is linear in p. For example, the absolute value scoring rule L(Y,p) = 
\Y — p\ is associated with L(p,p) = p(l — p) + (1 — p)p that is linear in each 
argument. Using the notation (2.4), the scoring rule L(Y,p) is proper if 
L(p,p) < L(p,p) for all p,p £ [0, 1], and is strictly proper if mino<p<i L(p,p) 
is uniquely attained at p = p. The scoring rule \Y — p\, therefore, is not 
proper; moreover, \p — p\ does not have a linear equivalent. 

3. A new approach to evaluation of probability forecasts. In this sec- 
tion we first consider the evaluation of a sequence of probability forecasts 
pi , . . . , p n based on the corresponding sequence of indicator variables 
Yi,...,Y n that denote whether the events actually occur or not. Whereas 
the traditional approach to evaluating p = (pi, . . . ,p n ) uses the scoring rule 
n _1 EILi L(Yi, pi), we propose to evaluate p via 

n 

(3.1) L n = n~ i y^L(pi,p i ), 

i=i 

where L is a loss function, and pi is the actual probability of the occurrence 
of the ith event. Allowing the actual probabilities pi to be generated by 
a stochastic system and the forecast pk to depend on an information set 
Gk-i that consists of the event and forecast histories and other covariates 
before Y^ is observed, the conditional distribution of Y{ given Qt-± and pi is 
Bernoulli(pi), and therefore 

(3.2) P(Y i = l\g i ^ 1 ,p i )=p i . 

3.1. Linear case. In view of (3.2), an obvious estimate of the unknown 
Pi is Yj. Suppose L(p,p) is linear in p, as in the case of linear equivalents of 
general loss functions. Combining this linearity property with (3.2) yields 

(3.3) E{L(Y i ,p i )\Q i ^ 1 ,pi} = L(pi,pi), 

and therefore L(Yi,pi) — L(pi,pi) is a martingale difference sequence with re- 
spect to {Fi}, where Ti-\ is the u-field generated by Qi-\ and pi, . . . ,pi. Let 
di = L(Yi,pi) — L(pi,pi). Since L(y,p) is linear in y, we can write L(y,p) = 
a(p)y + b(p). Setting y = and y = 1 in this equation yields a(p) = L(l,p) — 
L(0,p). Moreover, di = a{pi){Yi — Pi)- Since Yi\Ti ~ Bernoulli(pi) and pi is 
Ti- 1 -measurable, 

(3.4) E(d\\T^ x ) = a 2 (pi) Pl (l - Pi). 

By (3.4), Ei^C^l^i-O = £i ~ L(0,pi)} 2 Pl (l - Pi ) = 0(n) a.s., 
and therefore n 1 EILi di ^ a.s. by the martingale strong law 
[Williams (1991), Section 12.14] proving L n — L n — > a.s. Moreover, if 
n _1 Ei E(di l-^-i ) = a n converges in probability to a nonrandom positive 
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constant, then \fri\L n — L n )/a n has a limiting standard normal distribution 
by Theorem 1 of Seillier-Moiseiwitsch and Dawid (1993). Summarizing, we 
have the following. 

Theorem 1. Suppose L(p,p) is linear in p. Let L n = n" 1 Y17=i L{Yi,pi), 
and define L n by (3.1). Letting 

n 

(3.5) al = rT 1 - L(0,k)} 2 Pi (1 - Pi ), 

i=l 

assume that o~\ = 0(1) with probability 1. Then L n — L n converges to 
with probability 1. Lf converges in probability to some nonrandom positive 
constant, then y/n(L n — L n ) / o~ n has a limiting standard normal distribution. 

To apply Theorem 1 to statistical inference on L n , one needs to address 
the issue that a\ involves the unknown pi . As noted in the third paragraph 
of Section 1, Seillier-Moiseiwitsch and Dawid (1993) have addressed this is- 
sue by using pi = En(Yi\Gi-i) under the null hypothesis Hq that assumes 
the sequence of events are generated by the probability measure II. This 
approach is related to the earlier work of Dawid (1982), who assumes a 
"subjective probability distribution" II for the events so that Bayesian fore- 
casts are given by pi = iT{ = E'nQ'i Letting ^ = 1 or according to 
whether time t is included in the "test set" to evaluate forecasts, he calls the 
test set "admissible" if £t depends only on Qt-i, and uses martingale theory 
to show that 

(n n \ / n ( n 

Y,^ Y i - X>& / X> — ► a - s - m on I = 00 
i=l i=l / ' i=l I i=l 

From (3.6), it follows that for any < x < 1, the long-run average of Y{ 
(under the subjective probability measure) associated with Pi = x (i.e., & = 
L{p i=x y) is equal to x provided that YH=i^{pi=x} ~^ °°- Note that Dawid's 
well-calibration theorem (3.6) involves the subjective probability measure 
II. DeGroot and Fienberg (1983) have noted that well-calibrated forecasts 
need not reflect the forecaster's "honest subjective probabilities," that is, 
need not satisfy Dawid's coherence criterion pi = iXi. They therefore use a 
criterion called "refinement" to compare well-calibrated forecasts. 

In this paper we apply Theorem 1 to construct confidence intervals for L n , 
under the actual probability measure P that generates the unknown pi in 
(3.5). Whereas substituting pi by Yi in L(pi,pi) leads to a consistent estimate 
of L n when L is linear, such substitution gives as an overly optimistic 
estimate of pi(l — Pi) = Var(Yi| Fi^i). A conservative confidence interval for 
L n can be obtained by replacing pi(l — pi) in (3.5) by its upper bound 1/4. In 
Section 3.4, we consider estimation of a\ and of n _1 ^27=1 L(pi,pi) when L is 
nonlinear in pi, under additional assumptions on how the pi are generated. 
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3.2. Application to comparison of probability forecasts. Consider two se- 
quences of probability forecasts p' = (£[,... ,p' n ) and p" = (p", . . . ,p") of 
p = (pi, . . . ,p n ). Suppose a loss function L(p,q) is used to evaluate each 
forecast, and let L(p,q) be its linear equivalent. Since L(p,q) — L(p,q) does 
not depend on q in view of (2.2), it is a function only of p, which we denote 
by d{p). Hence 

Lipi^) - L(p h p'l) = {L(pi,p^ + d(pi)} - {L(p u p'!) + d(pi)} 

= L(pi,p'i) - L(pi,p") 

is a linear function of pi, and therefore we can estimate A n = n~ l ^2™ = i{L(pi, 
- L(pi,p'{)} by the difference n" 1 Yh=i L ( Y i>Pi) ~ n ~ X Yh=i L { Y hP'i) of 
scores of the two forecasts. Application of Theorem 1 then yields the follow- 
ing theorem, whose part (ii) is related to (2.4). 

Theorem 2. Let A n = n^ 1 Yli=i{ L ( Y hP'i) ~ H Y i,Pi)} and 
Si = - L(0,$)} - {L(l,p'{) - L(0,$')}, 

(3.7) 

n 

sl = n~ 1 J2 6 iPi( 1 -Pi)- 

i=l 

(i) Suppose L has a linear equivalent. Letting A n = n^ 1 X^ii-^fe'Pi) — 
L(pi,p'-)}, assume that s^ = 0(1) with probability 1. Then A n — A n con- 
verges to with probability 1. If furthermore s n converges in probability to 
some nonrandom positive constant, then y/n{A n — A n )/s n has a limiting 
standard normal distribution. 

(ii) Without assuming that L has a linear equivalent, the same conclusion 
as in (i) still holds with A n = n~ l Y17=i{^iPi + ^i^iP'i) ~ L{Q,p'l)}- 

3.3. Illustrative applications and skill scores. As an illustration of The- 
orem 2, we compare the Brier scores for the fe-day ahead forecasts 

(k) 

p t , 1 < k < 7, for Queens, NY, provided by US National Weather Service 
from June 8, 2007, to March 31, 2009. Table 1 gives the values of B\ and 
Bk — Bk_i for 2 < k < 6. Using 1/4 to replace pi(l — pi) in (3.7), we can use 
Theorem 2(i) to construct conservative 95% confidence intervals for 

{n n ~| 

t=i t=i j 

in which pt is the actual probability of precipitation on day t. These confi- 
dence intervals, which are centered at B^ — B^-i, are given in Table 1. The 
results show significant improvements, by shortening the lead time by one 
day, in forecasting precipitation k = 2, 3, 4, 6. 
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Table 1 





Brier 


scores B\ 


and 95% confidence 


intervals for 


A(fe) 




Bi 


A(2) 


A(3) 


A (4) 


A(5) 


A(6) 


A(7) 


0.125 


0.021 


0.012 


0.020 


0.010 


0.015 


0.007 




±0.010 


±0.011 


±0.012 


±0.011 


±0.011 


±0.010 



For another application of Theorem 2, we consider Winkler's (1994) skill 
score. To evaluate weather forecasts, a skill score that is commonly used 
is the percentage improvement in average score over that provided by cli- 
matology, denoted by p\ and considered as an "unskilled" forecaster, that 
is, 

{n n \ n 

n- 1 m,Pt) - n- 1 L (Xi,Pi) \ /n" 1 £ L(Y u tf). 
i=l i=l J i=l 

Climatology refers to the historic relative frequency, also called the base 
rate, of precipitation; we can take it to be f>1 = (M + Ym,=-m Noting 
that (3.8) is not a proper score although it is intuitively appealing, Winkler 
(1994) proposed to replace the average climatology score in the denominator 
of (3.8) by individual weights l(pi,pf), that is, 

n 

(3.9) W n = n" 1 - L{Y h pf)} /l( Pi ,pt), 

i=i 

where l(p,c) = {L(l,p) - L(l, c)}/ {p > c} + {L(0,p) - L(0, c)}I {p<c} . Theo- 
rem 2(i) can be readily extended to show that Winkler's score W n is a 
consistent estimate of 

n 

(3.10) w n = n- 1 Y,{L(pi,Pi) ~ L(jH,ffl}/l(pi,0i) 

i=i 

and that y/n(W n — w n )/s n has a limiting standard normal distribution, 
where 

n 

(3.11) S 2 n = n- 1 ^iPi^-Pi)/l 2 (Pi,M)- 

i=l 

Winkler (1994) used the score (3.9), in which L(p,p) = (p — p) 2 , to evaluate 
precipitation probability forecasts, with a 12- to 24-hour lead time, given by 
the US National Weather Service for 20 cities in the period between April 
1966 and September 1983. Besides the score (3.9), he also computed the 
Brier score and the skill score (3.8) of these forecasts and found that both 
the Brier and skill scores have high correlations (0.87 and 0.76) whereas 
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Climatology 



0.2 0.3 0.4 

Climatology 



Fig. 1. The Winkler and skill scores versus climatology. 



(3.9) has a much lower correlation 0.44 with average climatology, suggesting 
that (3.9) provides a better reflection of the "skill" of the forecasts over 
an unskilled forecasting rule (based on historic relative frequency). Instead 
of using correlation coefficients, we performed a more detailed analysis of 
Winkler's and skill scores to evaluate the one-day ahead probability forecasts 
of precipitation for six cities: Las Vegas, NV; Phoenix, AZ; Albuquerque, 
NM; Queens, NY; Boston, MA; and Portland, OR (listed in increasing order 
of relative frequency of precipitation), during the period January 1, 2005, 
to December 31, 2009. The period January 1, 2002, to December 31, 2004, 
is used to obtain the past three years' climatology, which is used as the 
reference unskilled score in the calculation of the skill score and Winkler's 
score (3.9). The left panel of Figure 1 plots Winkler's score against the 
relative precipitation frequency taken from the period January 1, 2005, to 
December 31, 2009, which is simply the percentage of days with rain during 
that period and represents the climatology in (3.8). The dashed line in the 
right panel of Figure 1 represents linear regression of the skill scores (3.8) 
on climatology and has a markedly positive slope of 0.95. In contrast, the 
regression line of Winkler's scores on climatology, shown in the left panel of 
Figure 1, is relatively flat and has slope 0.12. Unlike skill scores, Winkler's 
scores are proper and provide consistent estimates of the average loss (3.10) 
involving the actual daily precipitation probabilities pi for each city during 
the evaluation period. The vertical bar centered at the dot (representing 
Winkler's score) for each city is a 95% confidence interval for (3.10), using a 
conservative estimate of (3.11) that replaces pi(l —pi) by 1/4. The confidence 
intervals are considerably longer for cities whose relative frequencies pi of 
precipitation fall below 0.1 because 5f /l 2 (pi,p1) tends to be substantially 
larger when p^ is small. 
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3.4. Risk buckets and quadratic loss functions. Both (3.5) and (3.7) in- 
volve pi(l — pi), which is the variance of the Bernoulli random variable YJ. It 
is not possible to estimate this variance based on a single observation unless 
there is some statistical structure on the pi to make (3.5) or (3.7) estimable, 
and a conservative approach in the absence of such structure is to use the 
upper bound 1/4 for pi(l — Pi) in (3.5) or (3.7), as noted in Section 3.1. 
One such structure is that the pi can be grouped into buckets within which 
they have the same value, as in risk assessment of a bank's retail loans (e.g., 
mortgages, automobile loans and personal loans), for which the obligors are 
grouped into risk buckets within which they can be regarded as having the 
same risk (or more precisely, the same probability of default on their loans). 
According to the Basel Committee on Banking Supervision [(2006), page 91] 
each bank has to use at least seven risk buckets for borrowers who have not 
defaulted and at least one for those who have defaulted previously at the 
time of loan application. 

A bucket model for risk assessment involves multivariate forecasts for 
events k, 1 < k < Kt, at a given time t. Thus, identifying the index i with 
(t, k), one has a vector of probability forecasts (pt,i, ■ ■ ■ ,Pt,K t ) at time t — 1 
for the occurrences of Kt events at time t; Kt = if no forecast is made at 
time t — 1. The information set can then be expressed as Gt-i that consists 
of event and forecast histories and other covariates up to time t — 1, and 
therefore conditional on Qt-i and pt,i, ■ ■ ■ ,Pt,K t > the events at time t can be 
regarded as the outcomes of K% independent Bernoulli trials with respective 
probabilities pn, ■ ■ ■ ,Pt,K t - The bucket model assumes that, conditional on 
Qt-i and pt i, . . . ,ptK t , events in the same bucket at time t have the same 
probability of occurrence. That is, the pt,k are equal for all k belonging to 
the same bucket. Let Jt be the number of buckets at time t and rijt be 
the size of the jth bucket, 1 < j ' < Jt, so that n = J2t=i ^2j=i n j,t- Then the 
common p t ^ of the jth bucket at time t, denoted by pt{j), can be estimated 
by the relative frequency Y t (j) = n~\ Y2iel- 1 w here Ijj denotes the index 
set for the bucket. This in turn yields an unbiased estimate 

(3.12) v t (j) = n^YtUK 1 ~ Y t (j))/(n jtt - 1) 

of pi(l — Pi) for i G Ijj, and we can replace pi(l — Pi) in (3.5) or (3.7) by 
vt(j) for i G Ij t so that the results of Theorems 1 or 2 still hold with these 
estimates of the asymptotic variance, as shown in the following. 

Theorem 3. Using the same notation as in the preceding paragraph, 
suppose nj t t > 2 for 1 < j < J t and define vt(j) by (3.12). 

(i) Under the same assumptions as in Theorem 1, define 

T Jt 

& ' = n ~ l EE E - mp t )} 2 v t (j). 

t=i j=i iei jft 
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Then — a\ converges to with probability 1. 

(ii) Under the same assumptions as in Theorem 2, — s^ converges to 
with probability 1, where s 2 = n -1 Yn=i ^2j=i ^ieij t 

Proof. Let Tt-\ be the cr-field generated by Qt-i and p Sj x, . . . ,p s ,K a for 
s < t. Note that v t (j) = Eie/^C^ - W)? 7 '(n jt t - 1) and that 

(3.13) E(v t (j)\T t -i) =Pt(m-Pt(j)), 

which is the variance of Yi associated with Ij f. Therefore 

EWVft(j1(i-ft(jl)}{ E l L M) - WM 3 } 

3=1 W** J 

is a martingale difference sequence with respect to {Ft}- Hence we can apply 
the martingale strong law as in the proof of Theorem 1 to show that a 2 — o~ 2 
converges a.s., and the same argument also applies to s 2 n — s 2 n . □ 

The preceding proof also shows that for the squared error loss L(p,p) = 
(p ~ p) 2 i we can estimate (3.1) in the bucket model by the adjusted Brier 
score 

T J t 

(3-14) Ln-n-^^n^Mj), 

t=l j=i 

since L n = n~ 1 ^2 r l =1 L(Yi,pi) is a consistent estimate of the linear equiva- 
lent n~ l YJi=i{Pi ~ 2 PiPi +Pi), and J2t=i J2jLi n j,tVt{j) is a consistent 
estimate of n~ /2i=iPiQ — Pi)- Consistency of an estimate l n of l n means 
that l n — l n converges to in probability as n — > oo. Moreover, the following 
theorem shows that y/n{L n — n~ l Y^t=i 2~2j=i n j,tVt{j) — L n ) has a limiting 
normal distribution in the bucket model and can be studentized to give a 
limiting standard normal distribution. Its proof is given in Section 5. 

Theorem 4. Suppose n^ t > 2 for 1 < j < J t . Letting L(p,p) = (p - 
p) 2 , define L n by (3.1) and the adjusted Brier score by (3.14)- Let vt{j) = 
Pt{j)0- -Pt{j)), 

T J 

/«=«- 1 EE{^)E( 1 -2ft) 2 

t=i j=i ^ ieij, t 

(3.15) - 2v t (j)(l - 2p t (j)) (1 " 2ft) 

ieij, t 

+ n jtt v t (j)(l - 4v t (j)) + 2n ht v 2 (j)/(n jtt -1)1. 



EVALUATING PROBABILITY FORECASTS 



13 



If f3 n converges in probability to some nonrandom positive constant, then 
y/n{L n — n" 1 Ylt=i Sj=i n j,t v t{j) — Ln)/ Pn has a limiting standard normal 
distribution. Moreover, if nj > 3 for all 1 < j < Jt, then f3 n — f3 n converges 
to with probability 1, where 



t=i j=i < ien t 



2r? 2 



K* - 1) 3 



(3.16) 



+ 



An jy t{n jy t - 1) 



(n jt t ~ 2) 2 



3.5. Quasi-buckets and reliability diagrams. When the actual pt,i in a 
bin with index set Ijt are not the same for all i E Ijt, we call the bin a 
"quasi-bucket." These quasi-buckets are the basic components of reliability 
diagrams that are widely used as graphical tools to evaluate probability fore- 
casts. In his description of reliability diagrams, Wilks [(2005), Sections 7.1.2, 
7.1.3] notes that reliability, or calibration, relates the forecast to the aver- 
age observation, "for specific values of (i.e., conditional on) the forecast." A 
widely used approach to "verification" of forecasts in meteorology is to group 
the forecasts pi into bins so that "they are rounded operationally to a finite 
set of values," denoted by p(l), ■ ■ ■ ,p(J)- Corresponding to each p(j) is a set 
of observations Y^i G Ij, taking the values and 1, where Ij = {i :pi =p(j)}- 
The reliability diagram plots Y(j) = (J2i<=i ^i)/ n j versus p(j), where rij is 
the size of Ij; see Figure 3 in Section 4. Statistical inference for reliability 
diagrams has been developed in the meteorology literature under the as- 
sumption of "independence and stationarity," that is, that (pj,Yj) are i.i.d. 
samples from a bivariate distribution of forecast and observation; see Wilks 
[(2005), Section 7.9.3] and Brocker and Smith (2007). Under this assump- 
tion, the index sets Ij define a bucket model and a (1 — a)-level confidence 
interval for the common mean p(j) of the Y{ for % G Ij is 



(3.17) 



- a/ 2{Y(m 

where z q is the qih quantile of the standard normal distribution. 

The assumption of i.i.d. forecast-observation pairs is clearly violated in 
weather forecasting, and this has led to the concern that the confidence 
intervals given by (3.17) "are possibly too narrow" [Wilks (2005), page 331]. 
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The temporal dependence between the forecast-observation pairs can be 
handled by incorporating time as in Section 3.4. To be specific, let Pt,k,k < 
Kt, be the probability forecasts, at time t — 1, of events in the next period. 
We divide the set {ftt,k ■ k < K t , 1 < t < T} into bins B\, . . . , Bj, which are 
typically disjoint sub-intervals of [0,1]. Let 

Ij,t = {k-Pt,k ^Bj}, 

(3-18) Y t (j) = E Yi/ nj , t , 

t=i ieij^ 

where rijt is the cardinality of Ijt and rij = Ylt=i n j,t- Note that rijj an d 
Yt(J) have already been introduced in Section 3.4 and that Y(j) is the 
average of the observations in the jth bin, as in (3.17). In the absence of any 
assumption on p$ for i G Ij )t , these index sets define quasi-buckets instead of 
buckets. We can extend the arguments of Section 3.4 to the general case that 
makes no assumptions on the pi and thereby derive the statistical properties 
of Y(j) without the restrictive assumption of i.i.d. (pi,Yi). With the same 
notation as in Section 3.4, note that the index sets Ij t defined in (3.18) are 
Gt-x -measurable since the pt t k are ^_i-measurable. 

Whereas Yt(j) is used to estimate the common value of pi for i G Ijj and 
vt(j), defined in (3.12), is used to estimate the common value of pi(l — p-i) 
in Section 3.4, the pi in quasi-buckets need no longer be equal. Replacing Yj 
by pi in Y(j) and taking a weighted average of vt(j) over t, we obtain 

foirt -i -\ ^t=i^ieij, t pi ^ / -\ Tj=i n j,tvt(j) 

(3.19) p(j) = 3 - — , v(j)= . 

rij nj 

Instead of (3.17) that is based on overly strong assumptions, we propose to 
use 

(3-20) Y(j)±z 1 _ a/2 {v(j)/n j } 1 / 2 

as a (1 — a)-level confidence interval for p(J). Part (iii) of the following 
theorem, whose proof is given in Section 5, shows that the confidence interval 
tends to be conservative. Parts (i) and (ii) modify the estimates in Theorem 
3 for a\ and s\ when the pi in the assumed buckets turn out to be unequal. 



Theorem 5. With the same notation as in Theorem 3, remove the 
assumption that pi are all equal for i G Ij )t but assume that Ijj is Gt-l- 
measurable for 1 < j < Jt . 
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(i) Under the assumptions of Theorem 1, let 
, T J t 

(3.2i) = - EE E w,Pi) - l(o,k)} 2 (^ - ^)) 2 -^rr- 



t=i j=i iex 



TTien cr^ > <r^ + o(l) a.s. Moreover, if the pi are equal for all i G Ijt and 
1 < j < J t , then a\ — a\ converges to a. s. 

(ii) Under the assumptions of Theorem 2, s\ > + o(l) a.s., where 

(3.22) 3=-EE E ^^-™) 2 ^t- 

t=i j=i ie / j t 

(iii) Suppose Jt = J for all t = 1, . . . ,T. For 1 < j < J , define Y(J) by 
(3.18), and p(j) and v(j) by (3.19), in which rij = Ylt=i n j,t- Let 

T 

(3.23) v(j)=nj 1 ^2^2 Pi (l-p i ), 

t=i ieij tt 

and let n = n\ + ■ • ■ + rij be the total sample size. Suppose rij/n and v{j) 
converge in probability to nonrandom positive constants as n — > oo. Then 
(nj/v{j)) l l 2 {Y(j) — p(j)} has a limiting standard normal distribution as 
n—> oo. Moreover, v(j) > v(j) + o p (l) and equality holds if the pi are equal 
for all i 6 Ij f. 

Note that the numerator of (3.12) is equal to Eiei- Q^i ~ Yt{j)) 2 ■ The 
estimate (3.21) or (3.22) essentially replaces this sum by a weighted sum, 
using the weights associated with pi(l — pi) in the sum (3.5) or (3.7) that 
defines a 2 or s 2 n . The term njj/irijj — 1) in (3.21) and (3.22) corresponds to 
the bias correction factor in the sample variance (3.12). Theorem 5 shows 
that (3.21) [or (3.22)] is still a consistent estimator of a 2 (or s 2 ^) if the bucket 
model holds, and that it tends to over-estimate a 2 (or s^) otherwise, erring 
only on the conservative side. 



4. Simulation studies. The risk buckets in Section 3.4 and the forecasts 
are usually based on covariates. In this section we consider T = 2 in the case 
of discrete covariates so that there are Jt buckets of various sizes for n = 
E 2 =i E/= i n j,t — 300 probability forecasts prior to observing the indicator 
variables Y\,...,Y n of the events. We use the Brier score and its associated 
loss function L(p,p) = (p — p) 2 to evaluate the probability forecasts and 
study by simulations the adequacy of the estimates $ 2 and s^ and their use 
in the normal approximations. The simulation study covers four scenarios 
and involves 1,000 simulation runs for each scenario. Scenario 1 considers the 
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Brier score of a forecasting rule, while Scenarios 2-4 consider the difference 
of Brier scores of two forecasts. The bucket sizes and how the pi and pi are 
generated in each scenario are described as follows. 

Scenario 1. There are ten buckets of size 15 each for each period. 
The common values pt{j) in the buckets are 0.1, 0.25, 0.3, 0.35, 0.4, 0.5, 
0.65, 0.7, 0.75 and 0.8, respectively, for t = 1,2. The probability forecast 
Ptki 1 <■ k < 150, made at time t — 1, uses covariate information to identify 
the bucket j associated with the A;th event at time t and predicts that it 
occurs with probability Yj_i(j), assuming that 150 indicator variables at 
time are also observed so that Yo(j) is available. 

Scenario 2. For each period, there are nine buckets, three of which 
have size 2 and two of which have size 5; the other bucket sizes are 24, 30, 
35 and 45 (one bucket for each size). The bucket probabilities pt(j) are i.i.d. 
random variables generated from Uniform (0,1). The forecast p t ^ is the same 
as that in Scenario 1, and there is another forecast p' t ^ = Yt-i that ignores 
covariate information. 

Scenario 3. For each period, there are five buckets of size 30 each, 
and pt(j) = —0.1 +j/5 for j = 1, . . . , 5. The two forecasts are the same as in 
Scenario 2. 

Scenario 4. This is the same as Scenario 3, except that pi is uniformly 
distributed on [(j — l)/5,j/5] for % 6 Ijj, that is, the bucket assumption is 
only approximately correct. 

Figure 2 gives the Q-Q plots of ^/n(L n -n~ l Y?t=i Y,f=\ n j,tVt{j) - L n )/j3 n 
for Scenario 1 and \/n(A n — A n )/s n for Scenarios 2-4. Despite the deviation 
from the assumed bucket model in Scenario 4, the Q-Q plot does not deviate 
much from the 45° line. Table 2 gives the means and 5-number summaries 
(minimum, maximum, median, 1st and 3rd quartiles) of s n /s n for Scenarios 
2-4 and f3 n /Pn for Scenario 1. 

To illustrate the reliability diagram and the associated confidence intervals 
(3.20) in Section 3.5, we use one of the simulated data sets in Scenario 4 to 
construct the reliability diagram for the forecasts fit & (t = 1,2; k = 1, . . . , 5), 
grouping the forecasts over time in the bins [(j — l)/5,j/5],j = 1,...,5, 
that are natural for this scenario. The diagram is given in Figure 3. Table 
3 gives the means, standard deviations (SD), and 5-number summaries of 
Y(j),p(j),v(j) and v(j) defined in (3.18), (3.19) and (3.23) based on the 
1,000 simulations. In particular, it shows that v(j) tends to over-estimate 
v(j). Moreover, the probability of coverage of the 95% interval (3.20) for 
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Scenario 1 Scenario 2 




-3-2-10123 -3-2-10123 
Normal Quantiles Normal Quantiles 



Fig. 2. Q-Q plots for Scenarios 1~4- 
Table 2 

Simulation results for p n /f3 n (Scenario 1) and s n /s 





Min. 


1st qrt. 


Median 


3rd qrt. 


Max. 


Mean 


Scenario 1 


0.6397 


1.0840 


1.1810 


1.2830 


1.6520 


1.1780 


Scenario 2 


0.7442 


0.9647 


1.0060 


1.0490 


1.1970 


1.0050 


Scenario 3 


0.7586 


0.9506 


1.0060 


1.0570 


1.2070 


1.0010 


Scenario 4 


0.7420 


0.9661 


1.0180 


1.0730 


1.2240 


1.0160 



p(j), evaluated by averaging over the 1,000 simulations, is 0.949, 0.947, 0.944, 
0.940 and 0.928, for j = 1, . . . , 5, respectively, suggesting that the results of 
Theorem 5 still apply even for moderate sample sizes. We do not consider 
the second forecast p' tk = Yt-i to illustrate reliability diagrams because by 
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Fig. 3. Reliability diagram for the forecasts pt,k- At the midpoint of each of the five bins 
[U ~ 1) /5, j/5], j = 1, ... ,5, a 95% confidence interval, centered at Y(j), forp(j) is shown; 
only the upper (or lower) half of the interval is shown at j = 1 (or 5) to keep the range of 
the vertical axis between 0.1 and 0.9. 

Table 3 







Simulation 


results for 


p(j),Y(j),v 


(j) and v(j) 








Min 


1st qrt. 


Median 


3rd qrt. 


Max 


Mean 


SD 


p(l) 


0.101 


0.101 


0.101 


0.168 


0.234 


0.121 


0.033 


y(i) 


0.017 


0.083 


0.117 


0.156 


0.350 


0.123 


0.051 


v(l) 


0.087 


0.087 


0.087 


0.127 


0.167 


0.100 


0.020 


0(1) 


0.067 


0.089 


0.106 


0.132 


0.233 


0.106 


0.037 


m 


0.101 


0.300 


0.300 


0.355 


0.515 


0.320 


0.049 


y(2) 


0.050 


0.267 


0.317 


0.378 


0.633 


0.319 


0.089 


v(2) 


0.087 


0.207 


0.207 


0.207 


0.247 


0.209 


0.015 


0(2) 


0.048 


0.208 


0.221 


0.239 


0.259 


0.213 


0.034 


P(3) 


0.300 


0.515 


0.515 


0.577 


0.701 


0.527 


0.058 


Y(3) 


0.217 


0.467 


0.533 


0.589 


0.833 


0.529 


0.096 


v(3) 


0.206 


0.233 


0.247 


0.247 


0.247 


0.239 


0.011 


0(3) 


0.144 


0.240 


0.249 


0.254 


0.259 


0.244 


0.015 


p(4) 


0.515 


0.659 


0.701 


0.701 


0.906 


0.690 


0.052 


y(4) 


0.367 


0.633 


0.689 


0.750 


1.000 


0.687 


0.090 


«(4) 


0.082 


0.206 


0.206 


0.206 


0.247 


0.204 


0.021 


0(4) 


0.063 


0.207 


0.217 


0.236 


0.259 


0.211 


0.035 


f>(5) 


0.769 


0.906 


0.906 


0.906 


0.906 


0.895 


0.026 


y(5) 


0.733 


0.867 


0.900 


0.933 


1.000 


0.892 


0.049 


«(5) 


0.082 


0.082 


0.082 


0.082 


0.164 


0.088 


0.016 


0(5) 


0.077 


0.084 


0.093 


0.120 


0.202 


0.096 


0.037 



the central limit theorem, the p' t k are concentrated around 0.5 and nearly 
all of the forecasts lie in the bin [0.4,0.6]. 
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5. Proofs of Theorems 4 and 5. Re-labeling the Yi as Y^i, . . . , Y tt K t , we 

note that conditional on Tt-x , {Yt k '■ 1 — ^ — Kt} is a set of independent 
Bernoulli random variables with respective parameters Ptx, ■ ■ ■ ,Pt K t - This 
point, which has been noted in the second paragraph of Section 3.4 and 
will be discussed further in Section 7, explains why we can first derive the 
result for the special case in which Y t ^ are independent and then modify the 
argument by conditioning on Tt—x and appealing to martingale theory. As 
an illustration, note that if Ijt, are i.i.d. Bernoulli random variables 

with common parameter pt(J), then vt{j) defined in (3.12) is an unbiased 
estimate of pt(j)0- — Pt{j)) and one can use the classical strong law of large 
numbers to derive the result. The proof of Theorem 3 basically shows that 
vt(j) is "conditionally unbiased" given Tt-x in the sense of (3.13) and then 
uses the martingale strong law to derive the result. To prove Theorem 4, we 
extend this idea to obtain a conditionally unbiased estimate of Var(-0t(j)) 
by first considering the i.i.d. case: let X±, . . . ,X m be i.i.d. random variables. 
As is well known, the sample variance v = E2=i(^« — X) 2 /(in — 1) is a U- 
statistic of order 2, with kernel h(Xi,Xk) = (Xj — Xk) 2 /2. Arvesen (1969) 
has shown that an unbiased estimate of the variance of the U -statistic is the 
jackknife estimate 

a i -i \ m ( 1 m } 2 

<m) 4^i;{-^i;ww-<> ■ 

m(m — 2r \ m — 1 
y ' i=x I k=x J 

Proof of Theorem 4. Use L(p,p) = (p — p) 2 to express n{L n — n~ x x 
ELi E/li n J,Mi) ~ L n} as 

T K t T J t 

( 5 - 2 ) EE^ 1 - 2 M(Y t , k - Pt ,k) -J2J2 n ^t(j)-Pt(j)(i-Pt(m, 

t=X k=X t=X j=X 

which is the difference of two martingales and is therefore a martingale. To 
compute the conditional variance (or predictable variation) of (5.2), we can 
use the "angle bracket" notation and formulas for predictable variation and 
covariation [Williams (1991), Section 12.12] to obtain 

(EE^- 2 ^)^-^)) 
\i=ifc=i / 

T K t 

( 5 - 3 ) = EE^ 1 - 2 M 2 E((Y t ,k -??t,fc) 2 PVi) 

t=x k=l 
T J, 



t=i j= x \eij, t ' 
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(5.4) 



I T K t T J t ' 

EE^ 1 - 2p t)fe )(F t)fc -p t)fe ),^x^ n ^^^') -p*0')(i -ptO'))] 

U=l fe=l t=l j=l i 



T J t 



EE Ed-** 

i=i j=i L ie/j, t 



Pt(j)(l-PtC/'))(l-2pt(j')), 



\*=ii=i / 
( 5 - 5 ) =EEk^0')(! -pt(j))[i - 4 Pt (i)(i - pt(m 

t=l j=l 

+ 2n j , t p 2 (j)(l-p t (j)) 2 /(n J , t -l)}. 

Combining (5.3), (5.4) and (5.5) yields formula (3.15) for the conditional 
variance of (5.2) divided by n. 

In view of (3.13), vt(j) is a conditionally unbiased estimate of Pt(j)(l — 
Pt(j)) given Tt-i- If ^ ~ Bernoulli (pi), then E(Y — p) 3 = p(l — p)(l — 2p). 
Hence a conditionally unbiased estimate of Pt(i)(l —pt{j))0- — 2pt(j)) given 
T t -i is 

(5-6) KVK t -l) 3 ] ^(Yi-YtU)) 3 , 

analogous to (3.13). Replacing p t {j)(l - 2pt(j)) in (5-4) by (5.6) 

and multiplying (5.4) by — 2/n gives the second summand of (3.16). Note 
that the first summand of (3.16) corresponds to replacing pt(j)(l — Pt(j)) 
in (5.3) by Vt(J)- The last summand of (3.16) corresponds to using the 
jackknife estimate (5.1) to estimate the conditional variance of vt(j) given 
J-t—i- Since {Yi,i G Ijj} is a set of i.i.d. random variables conditional on 
Tt-i, the jackknife estimate is conditionally unbiased given Tt-\\ see the 
paragraph preceding the proof of this theorem. The rest of the argument is 
similar to that of Theorem 3. □ 



Proof of Theorem 5. We first prove (iii). Using the notation in the 
paragraph preceding the proof of Theorem 4, recall that conditional on J-t—i, 
the Yt t k are independent Bernoulli (pt,k) random variables. Since Ij t t is Tt-\- 
measurable, it follows that Ylieij t (Xi ~Pi) i s a martingale difference sequence 
with respect to {F t } and E{ [£ iG/j . t (Yi - Pi)] 2 ^^} = T. i& i^ t Pi( l ~Pi)- 
Since n~ l Ylt=i T^i&ij t ~ P^ converges in probability to a nonrandom 
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positive constant as n — > oo, we can apply the martingale central limit the- 
orem as in the proof of Theorem 1 to conclude that 



{E?=iE ie /^(i - Pi)Y 12 

proving the first part of (iii). 

To prove the second part of (iii), and also (i) and (ii), we first show that 
for any nonnegative J-t—i -measurable random variables Wti, ■ ■ ■ ,Wt,K t i 

e{ w t)k { Y t,k-yt{j)) 2 \^t-i\ 

(5.7) 

> (1 - n^)wt,kPt,k{ 1 ~ Pt,k), 
keij, t 

in which EfeeJ, t means EieTj t wnen * * s represented as (t, k); see the second 
paragraph of Section 3.4. Define Y t (j),Y(j) and p(j) as in (3.18) and (3.19), 
and let pt(j) = (Efee/, ■ t Pt,k)/ n j,t- From the decomposition 

(5.8) Y t>k - Y t (j) = (Y tyk - Ptyk ) + (p t ,k-Pt(j)) + (Pitf) - Y t (j)), 
it follows that the left-hand side of (5.7) is equal to 

^2 w tik E[(Y t>k - Pt,kf\^t-i] + ^ wt,k(pt,k ~ PtU)) 2 

k&Ij,t k€lj,t 

(5.9) + w^kE[{Y t {j)-p t {j)f\F t ^] 

k£lj,t 

-2E (Y t (j)-p t (j)) J2 ™t,k{ Y t,k~Pt,k) 
keij, t 

by using the fact that conditional on T%—\ the Y^ k are independent Bernoulli. 
Since Y t (j) - p t (j) = Ylk^i t ^( Y t,k - Pt,k)/nj,t, we can use this fact again to 
combine the last two terms of (5.9) into 

(5-10) - J2(wt,k/n jjt )E[(Y t!k -p t>k ) 2 \T t -i}. 

keij, t 

Since Wt k > 0, we can drop the second term in (5.9) to obtain (5.7) from 
(5.9) and (5.10). Moreover, since this term is actually when the pt k are 
all equal for k £ Ij t t, equality holds in (5.7) in this case. 
Let w tj k = n j,i/( n j,t ~~ !)• Then (5.7) reduces to 

(5.11) E(n jtt vtU)\^t-i) > Yl PtA l -Pt,k)- 



fee/ 
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Under the assumptions of part (iii) of the theorem, we can apply the mar- 
tingale strong law to obtain 

T 

(5.12) y~){nj tt vt(j) ~ EinjjVtU^Ft-i^/rij — > a.s. on {nj -> oo}. 
t=i 

Combining (5.11) with (5.12) yields v(j) > v(j) +o p (l), with equality when 
the p t: k are all equal for k 6 Ij } f 

To prove part (i) of the theorem, put w t ^ = {L(l,fit,k) — L(0,pt t k)} 2n j,t/ 
(fij t — 1) in (5.7) and then use the same argument as in the preceding 
paragraph. The proof of part (ii) is similar. □ 

6. Extensions and connections to forecast comparison in econometrics. 

Our new approach to evaluating probability forecasts in Section 3 is based on 
consistent and asymptotically normal estimates of the average loss 
n J27=i L{Pi,fti), without any assumptions on how the observed indicator 
variables Yi and their forecasts pi are generated. The key to this approach is 
that conditional on J-i-%, Yi is Bernoulli (pi), and therefore martingale argu- 
ments can be used to derive the results in Section 3. In Section 6.1 we show 
how this approach can be extended to more general random variables Yi . As 
shown in (2.3), when Yi is an indicator variable, the conditional expectation 
of the score L(Yi,pi) given J~i-\ is a linear function of pi, but this does not 
extend to more general random variables Yi. In Section 6.2 we review the 
recent econometrics literature on testing the equality of the expected scores 
of two forecasts and discuss an alternative approach to statistical inference 
on the expected difference in average scores of two forecasts. 

6.1. Extensions to general predictands. A characteristic of (pi, Yi) in prob- 
ability forecasting is that E(Yi\J r i^i) = pi while the <5j_i-measurable forecast 
pi is an estimate of pi . The theorems in Section 3 and their martingale proofs 
in Section 5 can be easily extended to general random variables Yi when the 
loss function is of the form L(fii,fii), where //j = E{Yi\Fi-\) and fii is a 
forecast of Yi given Although Yi\Fi~\ ~ Bernoulli(pj) in Section 3, no 

parametric assumptions are actually needed when we use a loss function of 
the form L(//,,/ij). As in (2.2), such loss function is said to have a linear 
equivalent L if 

(6.1) L(y,y) is linear in y and L(y,y) — L(y,y) does not depend on y. 

The bucket model in Section 3.4 can be extended so that Yt^l^t-i have the 
same mean and variance for all (t, k) belonging to the same bucket. In place 
of (3.12), we now use 

(6.2) m= J2(Y i -Y t (j)) 2 /(n j!t -l) 
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as an unbiased estimate of the common conditional variance of Y{ given T%—\ 
for i = (t, k) 6 Ij t, using the same notation as that in the proof of Theorem 
4. While the extension of Theorem 3 only needs the first two moments of 
Y t ,k\Ft-i to be equal for all (t,k) belonging to the same bucket, Theorem 4 
can also be extended by assuming the first four moments of Y t j^J-t-i to be 
equal for all (t,k) belonging to the same bucket, by using Arvesen's (1969) 
jackknife estimate of the variance of a [/-statistic. 

Clearly (5.8), (5.9) and (5.10) also hold with p t ^ and pt(j) replaced by 
fit,k and fit(k), so Theorem 5 can likewise be extended to quasi-buckets 
and reliability diagrams for the predicted means pitk- For sample means in 
the case of independent observations within each bucket, this extension of 
Theorem 5 can be viewed as a corollary of the analysis of variance. In fact, 
the proof of Theorem 5 uses martingale arguments and conditioning to allow 
dependent observations in each (quasi-)bucket. 

6.2. Inference on expected difference in average scores of two forecasts. 
When the Y{ are indicator variables of events, Theorem 2(H) establishes 

asymptotic normality for the difference A n = n~ l Yli=i{L{Yi)Pi) ~ ^O^itPi)} 
in average scores between two forecasts, from which one can perform infer- 
ence on 

n 

(6,) >=; 

=n- i Y,{Sim+mp'i)-mp")}, 
i=i 

where $ = {£(1,$) - £(0,#)} - {L(l,%) - L(0,#')}. This simplicity, how- 
ever, does not extend to general 3^. 

Proper scoring rules for probability forecasts of categorical and contin- 
uous variables Y{ have been an active area of research; see the review by 
Gneiting and Raftery (2007). Another active area of research is related to 
the extension of A n to general Yi in the econometrics literature, beginning 
with the seminal paper of Diebold and Mariano (1995). They consider the 
usual forecast errors ej := Yj — Y t \ t _i in time series analysis, where Y t \ t _i is 
the one-step ahead forecast of Y t based on observations up to time t — 1. 
Unlike a probability forecast that gives a predictive distribution of Yt as 
in Gneiting and Raftery (2007), is a nonprobabilistic forecast that 

predicts the value of Y t [see Wilks (2005), Section 7.3]. The score used by 
Diebold and Mariano (1995) is of the form L(Yt,Y t ui) = g(et), and they 
consider the average loss differential 

n 

(6-4) A n = n- 1 ^{L(y t ,y t ; t „ 1 )-L(y t ,^;„ 1 )} 

t=t 
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between two forecasts Y^ui and Y^L^, 1 <t <n. Assuming a probability 
measure Q under which dt '■= g(e't) ~ 9( e t) ^ s covariance stationary with 
absolutely summable autocovariances 7^ so that /(0) := YlT=-oo 7fc/(2vr) is 
the spectral density at frequency 0, they use the asymptotic normality of 
A n under the null hypothesis Hq: Eg(dt) = and a window estimate /(0) 
of /(0) so that the test statistic y / nA n /(27r/(0)) 1 / 2 has a limiting standard 
normal distribution under Hq as n— > 00. This aysmptotic normality result, 
however, requires additional assumptions, such as stationary mixing, which 
they do not mention explicitly. Their work has attracted immediate attention 
and spawned many subsequent developments in the econometrics literature 
on this topic. 

Giacomini and White (2006), hereafter abbreviated as G&W, review some 
of the developments and propose a refinement of Hq for which the asymptotic 
normality of A n can be established under precisely stated conditions that 
can also allow nonstationarity. They formulate the null hypothesis of equal 
predictive ability of two forecasting models or methods as "a problem of 
inference about conditional expectations of forecasts and forecast errors that 
nests the unconditional expectations that are the sole focus of the existing 
literature." The econometrics literature they refer to is primarily concerned 
with "forecast models;" thus Q in the previous paragraph is the probability 
measure associated with the forecast model being evaluated, or with a more 
general model than the two competing forecast models whose predictive 
abilities are compared. 

G&W evaluate not only the forecasting model but also the forecasting 
method, which includes "the forecasting model along with a number of 
choices," such as the estimation procedure and the window of past data, 
used to produce the forecast. They consider /c-step ahead forecasts, for which 
Y t i t _i is replaced by Y t \ t _ k , and assume that the forecasts are based on finite- 
memory models involving unknown parameters, that is, 

(6.5) Y t+k \ t = h(Y t , . . . ,y t _ m+ i,xt, . . .,x t _ m+ i;/3), 

where h is a known function, m is the order of the model, Xt is a covariate 
vector at time t and (3 is a parameter vector to be estimated from some 
specified window of past data. Their formulation generalizes that of West 
(1996) who considers regression models. Whereas West assumes that the 
data are actually generated by the regression model with true parameter 
(3*, G&W allow model misspecification, and therefore their assumptions 
do not involve (3*. They consider two such nominal models, resulting in 
the forecasts YL_^ and Y^ t _ k that use the same covariates but different 

estimates /3 t and /3 t . Their null hypothesis 

(6.6) H : E{L(Y t ,%_ k ) - L(Y t , Y^ h )\Q^ h } = a.s. Vt > 1 
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seems to be stronger than EL(Y t ,Y^ t _ k ) = EL(Y t ,Y" t _ k ) Vi considered by 
Diebold and Mariano (1995) for the case k = 1. On the other hand, (6.6) 
in the case k = 1 just says that L(Y t ,Y^ t _ 1 ) — L(Yt,Y^f t _ 1 ) is a martingale 

difference sequence under Hq so that the martingale central limit theorem 
can be applied to derive the limiting x 2 -distribution of G&W's test statistics 
under Hq. Unlike Diebold and Mariano (1995) for the case k = 1, G&W do 
not use test statistics of the form (6.4) and their test statistics involve more 
complicated weighted sums of L(Yt,YL_ k ) — L(Y t , Y^ t _ k ). These weights are 
chosen to improve the power of the test and require additional mixing and 
moment assumptions on (xj,lf) given in their Theorems 1-3. 

The methodology developed in Section 3 and its extension outlined in Sec- 
tion 6.1 suggest an alternative approach to comparing econometric forecasts. 
As in (6.4), we consider the average score difference 

n 

(6-7) A n = n- 1 ^{L(Y t ,Yl) - L(Y t , f/')}, 

i=l 

in which Y{ and Y(' are forecasts that are £/f_i-measurable. Since A:-step 
ahead forecasts of Yt are Qt-i -measurable for any k > 1, the theory applies 
to all fe-step ahead forecasts of It, as illustrated in Table 1. Instead of hy- 
pothesis testing, our approach is targeted toward estimating 

n 

(6.8) & n = n- l Y,E{L{Y t ,YD-L{Y t ,Yl')\Gt-i}, 

i=l 

in which E is with respect to the actual but unknown probability measure 
P. Analogously to Theorem 2(i) for the case of binary Yi, we can apply the 
martingale central limit theorem to establish the asymptotic normality of 
A n — A n . In many applications, one can make use of the bucket structure 
of the type in Section 3.4 to estimate the asymptotic variance of A n . In 
particular, this structure is inherent in dynamic panel data in econometrics 
and longitudinal data in epidemiology, which is beyond the scope of this 
paper on forecasting probabilities of events and will be treated elsewhere. 
Note that the bucket structure is only used in estimating the asymptotic 
variance of A n by (6.2), and that Theorem 5 and its extension outlined in 
Section 6.1 imply that the variance estimate tends to be conservative if the 
assumed bucket structure actually fails to hold. 

7. Discussion. The average score n _1 Y17=i L(Yi,pi) measures the diver- 
gence of the predicted probabilities pi, which lie between and 1, from the 
indicator variables Yi that can only have values or 1. As noted by Lich- 
tendahl and Winkler (2007), this tends to encourage more aggressive bets 
on the binary outcomes, rather than the forecaster's estimates of the event 
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probabilities. For example, an estimate of 95% probability may lead to a 
probability forecast of 100% for a higher reward associated with the indi- 
cator variable Yf, see also Mason (2008), who gives an example in which a 
forecaster is encouraged to give such "dishonest" forecasts. This difficulty 
would disappear if one uses L to compare pi with the actual pi, rather than 
with the Bernoulli (pi) random variable Y{. Because the pi are unknown, 
this is not feasible and the importance of using a proper score L(Yi,pi) to 
evaluate a probability forecast has been emphasized to address the issue 
of dishonest forecasts. In Section 3.2 we have shown that it is possible to 
use L(pi,pi) — L(pi,p' i ) for comparing two forecasters and to construct con- 
fidence intervals of the average loss difference. A key idea underlying this 
development is the linear equivalent of a loss function introduced in Sec- 
tion 2. Schervish (1989), Section 3, has used a framework of two-decision 
problems involving these loss functions to develop a method for comparing 
forecasters. Our approach that considers L(pi,p' i ) — L(pi,p'() can be regarded 
as a further step in this direction. 

As noted in Section 3.5, an important assumption underlying statistical 
inference in the verification of probability forecasts in meteorology is that 
the forecast-observation pairs are independent realizations from the joint dis- 
tribution of forecasts and observations. Although Mason [(2008), page 32] 
has pointed out that this assumption cannot hold "if the verification score 
is calculated using forecasts for different locations, or if both the forecasts 
and observations are not independent temporally," not much has been done 
to address this problem other than using moving-blocks bootstrap [Mason 
(2008), Wilks (2005)] because traditional statistical inference does not seem 
to provide much help in tackling more general forecast-observation pairs. 
The new approach in Section 3 can be used to resolve this difficulty. It uses 
martingale theory to allow the forecast-observation pairs to be generated 
by general stochastic systems, without the need to model the underlying 
system in carrying out the inference. The treatment of spatial dependence 
is also covered in Section 3.5, in which dependence of the events at Kt lo- 
cations at time t is encapsulated in the highly complex joint distribution 
of their generating probabilities pt,i, ■ ■ ■ ,Pt,K t , which our approach does not 
need to model in performing inference on forecast validation. Our viewpoint 
in forecast evaluation is that one should try not to make unnecessary or arbi- 
trary assumptions on the underlying data-generating mechanism, especially 
in regulatory settings such as regulatory supervision of a bank's internal rat- 
ings models of loan default probabilities; see Section 3.4 and Lai and Wong 
(2008). A convenient but incorrect data-generating model that is assumed 
can unduly bias the comparison. 
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